References:
In essence, the road to machine learning starts with regression. Some typical use cases of machine learning in marketing are:
In its core, these tasks all seek to learn from data. In order to do so, we use a given set of features to train an algorithm and extract information we need. These algorithms, known as learners too, can be divided according to the amount and type of supervision needed during training. We distinguish supervised learners which construct predictive models, and unsupervised learners which build descriptive models. Which type you will need to use depends on the learning task you hope to accomplish.
In this chapter we will focus on supervised machine learning, more specifically, on one method called extreme gradient boosting.
When we talk about machine learning, there are two quite important factors that drive successful application:
- effective statistical models that are capable of capturing the complex data dependencies
- and how scalable are learning systems that learn the model from large datasets.
Among the machine learning methods commonly used in practice are gradient tree boosting methods. Gradient boosting machines (GBMs) are a very popular machine learning method, and in this chapter we will introduce you R package “xgboost” and show how it can be used for marketing purposes. It is a scalable machine learning system for tree boosting and GMBs have proven successful across many domains and are one of the leading methods you can find across Kaggle competitions.
When it comes to marketing, it can be used in uplift modeling, i.e. can help a company to identify those who are likely to buy products as a result of receiving a discount or a personalized advertisement. Consequently, it helps a company to maximize profits by keeping advertising costs and overall efforts to the minimum. In the perspective of data analysis and marketing, performance of a marketing campaign can be predicted using algoritham such as GBMs. For instance, in the banking industry optimizing targeting for telemarketing used to be one of the main issues, especially under a growing pressure induced by financial crisis in 2008. A commercial bank from Portugal used data-driven model to predict the result of a telemarketing phone call to sell long term deposits. It was a valuable tool to support client selection decisions made by bank managers. As a result, they identified that inbound calls and an increase in other, highly relevant attributes (such as agent experience or duration of pre-vious calls) enhance the probability for a successful deposit sell.
XGBoost is used in supervised machine learning. Let us decompose the meaning of supervised machine learning.
Supervised machine learning can be described as a process in which training data with multiple features (also called: predictor variables, independent variables, attributes, predictor) is used to predict a target variable (also called, dependent variable, response, outcome measurement).
The final outcome of supervised machine learning is a predictive model, so it is important to define what a model is. A model, in the context of supervised machine learning, contains a mathematical structure or algorithm by which the prediction of a target variable is made from the multiple features used as input. For instance, algorithm that helps you to predict the sale price of your house based on the house attributes. Another important term in the context of machine learning are parameters. They denote undetermined part that we need to learn from data.
Finally, as we said, models we build with supervised machine learning are predictive models. “Supervised” refers to a supervisory role of the target variable, which indicates the task that model needs to learn. More specifically, it means that the data which is used for training a model contains target variable. Given a set of training data, the learning algorithm attempts to find the combination of feature values that results in a predicted value as close to the actual target variable as possible.
Boosting can be explained as a sequential process. That means that at each particular iteration, a new weak model is trained with respect to the error of the whole ensemble learned by that time. A weak model is one whose predictions (error rate) are only slightly better than random guessing. In simple words, in each iteration a better model is created by adding a new weak model to the existing one, where the purpose of the weak model is to slightly improve the remaining errors of the existing model. This process slowly learns from data and tries to improve its prediction in subsequent iterations.
Among other, boosting is used for solving both regression and classification problems.
Below you can find an illustrative example and explanation for each of them.
In the illustration above you can see 4 boxes with pluses and minuses within them representing observations. The ultimate goal of a model we need to develop is of classification nature, i.e. to classify pluses (“+”) and minuses (-) within a box as accurate as possible.
It is important to mention that at the begining all observations are assigned equal weights. However, weights are subject to change after each iteration as misclassified observations in one iteration will be assigned higher weight in the next one. Opposingly, observations that are correctly classified in one iteration will be assigned lower weight in the subsequent iteration.
In the box 1, the first weak learner identified “+” signs just on the left side of the box. It simply misclassified three “+” signs in the middle “-” upper part and recognized only the two on the left side. Consequently, it split the box in two parts (blue and light red), meaning that everything that appears in the blue “-” marked area is classified as “+”, while the rest is classified as “-”.
Although our prediction model at this point does not do great job, it contains information useful for the next weak learner that is being added in the box 2. Next weak learner assigns more weight to three “+” signs that were previously misclassified. Similarly to the previous split, the weak learner split the box 2 again in blue “-” and red “-” marked area. Again, everything in the blue area (left from the splitting line) was classified as “+”, including three minus signs being misclassified. The rest was classified as -". Even though our predicting model looks a bit better, its classification is still incorrect.
In the box 3, our model is becoming even better in classifying. It splitted the box horizontally, so that everything below the line was classified as “-” and above the line as “+”. Despite the progress we still have some missclassified “-” in the blue-marked area as well as wrongly classified “+” below the splitting line (circled signs in the box 3).
Finally, in the box 4 we see the result from combining information obtained from numerous weak learners. It is a weighted combination of the weak classifiers resulting in a strong classifier.Each classifier individually proved pretty poor performance in predicting, as they all show certain misclassification error. However, after combining them, the ultimate goal to classify all points correctly is reached and strong classifier created.
To understand the whole concept easier, try to follow the image above. On the one hand, the blue curve depicts the real underlying function, while the points depict observations. Moreover, observations include some noise, i.e. errors. On the other hand, you can observe red curve representing constantly improving boosted prediction. More specifically, it illustrates the adjusted predictions after each additional sequential tree is added to the algorithm. At the beginning, you can observe large errors (.i.e. big deviation of the red curve from the blue one) which the boosted algorithm reduces pretty fast. However, as the predictions (.i.e. red curve) get closer to the true underlying function (i.e.blue curve), the contribution to model improvement of each additional tree is smaller and smaller. In the end, the predicted values nearly match to the true underlying function.
There is an extensive list of packages with GBMs and its variations. However, the most popular implementations which we will cover here is certainly xgboost, which is quite fast and efficient.
XGboost stands for “Extreme Gradient Boosting”, which is an efficient implementation of gradient boosting framework. The xgboost package has been quite popular and successful on Kaggle for data mining competitions.
# Turn off scientific notation
options(scipen = 9999)
# Helper packages
library(dplyr) # for general data wrangling needs
# Modeling packages
library(xgboost) # for fitting extreme gradient boosting
library(rsample) # for split of data set in the training data and test data
library(AmesHousing)# data set
library(caret) # for resampling and model training
library(plotly)
library(recipes)
library(pdp)
library(knitr)
library(gbm)
library(mlr)
library(ggplot2)
To explain gradient boosting with xgboost, we will use typical Ames Iowa Housing data set and try to build a model that predict sale price for houses.
# Ames housing data
ames <- AmesHousing::make_ames()
# Ensure correct naming
library(janitor)
ames<- ames%>%clean_names()
# Use mlr package to get an overview of your data
knitr::kable(head(summarizeColumns(ames)))
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| ms_sub_class | factor | 0 | NA | 0.6317406 | NA | NA | 1 | 1079 | 16 |
| ms_zoning | factor | 0 | NA | 0.2242321 | NA | NA | 2 | 2273 | 7 |
| lot_frontage | numeric | 0 | 57.64778 | 33.4994408 | 63.0 | 25.2042 | 0 | 313 | 0 |
| lot_area | integer | 0 | 10147.92184 | 7880.0177594 | 9436.5 | 3024.5040 | 1300 | 215245 | 0 |
| street | factor | 0 | NA | 0.0040956 | NA | NA | 12 | 2918 | 2 |
| alley | factor | 0 | NA | 0.0675768 | NA | NA | 78 | 2732 | 3 |
First, we need to deal with data preparation. When using xgboost package, it is necessary to convert the categorical variables into numeric using one hot encoding.
What is one hot encoding? Usually, when between several categories exist ordinal relationship (e.g. variable “place” can be “1st”, “2nd” and so on), all you need to do is so called the integer encoding. In the ordinal variable (such as “place”) the integer values have a natural ordered relationship between each other, so machine learning algorithms are able to understand this relationship. However, for categorical variables where no ordinal relationship exists (e.g. variable “pet” with “dog”, “cat” and “rabbit”), the integer encoding is not sufficient and you need one hot encoding. In the “pet” example, there are 3 categories, thus 3 binary variables are required. Therefore, “1” value is placed in the binary variable for the respective “color” and “0” values for the other colors.
cat(tabl)
| dog | cat | rabbit | |
|---|---|---|---|
| 1 | 1 | 0 | 0 |
| 2 | 0 | 1 | 0 |
| 3 | 0 | 0 | 1 |
We can apply one hot encoding to our data set by using R’s base function model.matrix. In the code below, ~.+0 leads to encoding of all categorical variables without producing an intercept.
$contrasts
unordered ordered
"contr.treatment" "contr.poly"
# One hot encoding (turning the test data into matrix with all numerical values)
ames.he <- model.matrix(~.+0,data = ames)
# Save variable names due to more practical addressing columns later on
setcol <- colnames(ames.he)
Next,we need to break our data set into training and test data, while ensuring we have consistent distributions between the training and test sets.
# Data split on test and train data
set.seed(1234)
# Create partition
library(caret)
index <- caret::createDataPartition(ames$sale_price, p = 0.7, list = F)
index is a matrix with just one column that contains approximately 70% of rows (i.e. numbers of rows) from our original data set.
Now we use index to address columns we want to assign to our train data (ames_train), and columns that we want to assign to our test data (ames_test).
Note that by using “-” in front of index we assign to ames_test all observations except those that are in index.
# Split data set in train and test data
ames_train <- ames.he[index, ]
ames_test <- ames.he[-index, ]
If we take lake look at the dimensions of our test and train data, we can see that our split was successful.
dim(ames.he) # 2930 observations in total
dim(ames_train)# 2053 observations for training data
dim(ames_test) # 877 observations for testing data
We are still not done with preparation of data. Since our task is to build a predictive model for house pricing based on multiple features, our target variable (sale price) needs to be excluded from the test data. The “real” target variable data (the real sales price) will be used at the end when we test accuracy of our predictive model.
# Test data
## Matrix containing all columns from the test data except dependent variable "Sale_Price"
colnames(ames_test[,300:308])# "sale_price" is a column number 308
[1] "sale_typeOth" "sale_typeVWD" "sale_typeWD "
[4] "sale_conditionAdjLand" "sale_conditionAlloca" "sale_conditionFamily"
[7] "sale_conditionNormal" "sale_conditionPartial" "sale_price"
# Addressing "Sale Price" column in matrix and excluding it
ames_x_test <- as.matrix(ames_test[,-308])
# No "Sale Price" anymore here! It used to be among last columns.
colnames(ames_x_test)[300:308]
[1] "sale_typeOth" "sale_typeVWD" "sale_typeWD "
[4] "sale_conditionAdjLand" "sale_conditionAlloca" "sale_conditionFamily"
[7] "sale_conditionNormal" "sale_conditionPartial" "longitude"
For training purposes, target variable (sale price) needs to be excluded from the rest of features, but not totally from the train data set.
# Train data set
# Matrix containing all columns from the train data except dependent variable "Sale_Price"
colnames(ames_train[,300:308])# "sale_price" is a column number 308
[1] "sale_typeOth" "sale_typeVWD" "sale_typeWD "
[4] "sale_conditionAdjLand" "sale_conditionAlloca" "sale_conditionFamily"
[7] "sale_conditionNormal" "sale_conditionPartial" "sale_price"
# Addressing "Sale Price" column in matrix and excluding it
ames_x_train <- as.matrix(ames_train[,-308])
# No "Sale Price" anymore here!
colnames(ames_x_train[,300:308])
[1] "sale_typeOth" "sale_typeVWD" "sale_typeWD "
[4] "sale_conditionAdjLand" "sale_conditionAlloca" "sale_conditionFamily"
[7] "sale_conditionNormal" "sale_conditionPartial" "longitude"
Therefore, the target variable is going to be stored separately because the learning algorithm in a predictive model attempts to discover and model the relationships among the target variable and the other features.
# Dependent/Target variable "Sales_Price" from the train data in a from of a vector
ames_y_train <- ames_train[,308]
In order to create a good predictive model, usually the most of time is spent optimizing parameters. Before we start training our model, let us take a closer look at what parameters we need to handle.
There are 3 categories XGBoost parameters can be divided into:
General parameters will not be discussed in further details, but it consists of 3 parameters:
booster - determines the booster type (gbtree, gblinear or dart) to use. For regression, you can use any. By default it is gbtree (which we will use as well).nthread - refers to the number of cores activated when computing. By default it uses maximum cores available, which leads to the fastest computation.silent - refers to turning on (“1”) running messages in R console. By default “0” is set, so that console does not get flooded with messages.For general parameters we will be using default options.
Next, booster parameters control the performance of the selected booster(gbtree in our case). At this moment we will introduce just the main ones:
nrounds - set the maximum number of iterations.eta - stands for the learning rate. It determines the rate at which our model learns patterns in data. After every iteration, it shrinks the feature weights to reach the best optimum. Smaller learning rates lead to longer computation time. It is important to note that smaller learning rates should be supported by increasing number of iterations. Otherwise, the risk of reaching the optimum is more likely. Usually, it lies between 0.01 - 0.3.max_depth - which determines the maximum depth of each tree. Generally, it is stands that larger the depth, more complex the model and consequently higher chances of overfitting. 4.min_child_weight - minimum number of observations required in each terminal nodesubsample - percent of training data to sample for each treecolsample_bytrees - percent of columns to sample from for each treeearly_stopping_rounds - stopping the training model as soon as evaluation metric (for regression that is “RMSE”) does not improve for a given number of roundsFinally, learning task parameters define methods for the loss function and model evaluation:
objective- for linear regression it should be set to “reg:linear”.eval_metric - this parameter depends on objective. Here we set metrics used to evaluate a model’s accuracy on validation data. When “reg:linear” set as objective, default metric is RMSE.A package with useful tools for parameter optimization is mlr. It includes extensive list of parameters for any type of algorithm. We can take a look at list with parameters for regression and check parameters we just discussed.
# Parameters for regression
getParamSet("regr.xgboost")
To build a well-performing predictive model many iterations are necessary. Therefore, in order to determine how good or bad one model predicts the target variable, performance evaluation needs to be conducted. A technique that will be used to help us in evaluating performance of our future machine learning models is called k-fold cross-validation technique. K-fold cross-validation evaluates a model by training a couple of models on subsets of the available input data and evaluating them on the complementary subset of the data. In this process, training data is split into k groups (i.e. folds) of approximately equal size. Then the model is fit on k−1 folds and the remaining fold is used in computation of the model performance. This procedure is repeated k times, where each time, a different fold is treated as the validation set (i.e. used in computation of the model performance). Thus, the final cross-validation k-fold estimate is computed by averaging the k test errors. The final output provided is an approximation of the error we may expect on unseen data.
The first model to pass to the k-fold cross validation will be built using default parameters. As default for number of iterations (nrounds) is zero, we will set it on 200.
set.seed(1234)
ames_xgb <- xgb.cv(
data = ames_x_train, # matrix with train data without sale price
label = ames_y_train, # numerical vector with sale price with train data
nrounds = 200, # number of iterations
objective = "reg:linear", # parameter referring to the function to be me minimised (RMSE)
nfold = 10, # data is randomly partitioned into n-fold equal size subsamples.
params = list( # defining the list of parameters
eta = 0.3, # learning rate
max_depth = 6, # maximal depth of tree
min_child_weight = 1, # minimum number of observations required in each terminal node
subsample = 1, # percent of training data to sample for each tree
colsample_bytree = 1 # percent of columns to sample from for each tree
),
verbose = 0 # print the statistics during the process (1 or 0)
)
[14:55:52] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:55:52] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:55:52] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:55:53] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:55:53] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:55:53] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:55:53] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:55:53] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:55:53] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:55:53] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
(eval <- ames_xgb$evaluation_log %>%
dplyr::summarise(
ntrees.train = which(train_rmse_mean == min(train_rmse_mean))[1],
rmse.train = min(train_rmse_mean),
ntrees.test = which(test_rmse_mean == min(test_rmse_mean))[1],
rmse.test = min(test_rmse_mean)))
NA
After conducting the 10-fold cross validation, we sorted the output so that it shows us at what iteration (round) our model reached the lowest error (eval$rmse.train=) when fitted to the seen part of training data and unseen part of the training data (eval$rmse.test=).
Side note: unseen part of the training data (rmse.test) has nothing to do with test data from the initial split we did at the very beginNing of the chapter and named as ames_test . Here we talk about unseen data in the process of k-fold cross-validation.
Unsurprisingly, our model performed very well when fitted to the seen data, suggesting the RMSE of eval$rmse.train. Here we see an evidence of overfitting. In other words, our model fits the training part of the training data very well (i.e. suggests low RMSE for train data), but is not generalizable, i.e. when confronted with unseen data, its predictions are not as good as for the trained part(i.e. suggests high RMSE for unseen, test data).
# Plot error vs number trees
pe<-ggplot(ames_xgb$evaluation_log) +
geom_line(aes(iter, train_rmse_mean), color = "red") +
geom_line(aes(iter, test_rmse_mean), color = "blue") +
xlab(label= "Iteration (round)")+
ylab(label = "Root Mean Square Error")+
ggtitle(label = "10-fold Cross-validation")
ggplotly(pe)
The gap between the blue and the red line in the graph above depicts the performance difference of our model when fitted to the seen data (eval$rmse.train) and when fitted to the unseen data (eval$rmse.test). Our goal is to make our model perform with unseen data as good as possible, i.e. to minimize RMSE as much as possible.
To pursue that goal, parameters, that we discussed earlier, need to be as optimally tuned as possible.
Therefore, some parameters should be adapted. For the following cross-validation process, we will slightly adapt parameters:
eta from 0.3 to 0.03early_stopping_rounds at 50max_depth to 3min_child_weight to 3subsample to 0.5colsample_bytree reduced to 0.5set.seed(1234)
ames_xgb1 <- xgb.cv(
data = ames_x_train, # matrix with train data without sale price
label = ames_y_train, # numerical vector with sale price with train data
nrounds = 2301, # number of iterations
objective = "reg:linear", # indicating regression model
early_stopping_rounds = 50, # stopping the training model as soon as evaluation metric (RMSE) does not improve for a given number of rounds
nfold = 10, # data is randomly partitioned into nfold equal size subsamples
params = list(
eta = 0.03, # learning rate
max_depth = 3, # maximal depth of tree
min_child_weight = 3, # minimum number of observations required in each terminal node
subsample = 0.5, # percent of training data to sample for each tree
colsample_bytree = 0.5 # percent of columns to sample from for each tree
),
verbose = 0
)
[14:56:28] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:56:28] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:56:28] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:56:28] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:56:28] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:56:28] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:56:28] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:56:28] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:56:28] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
[14:56:28] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
# Checking results
(eval1<-ames_xgb1$evaluation_log %>%
dplyr::summarise(
ntrees.train = which(train_rmse_mean == min(train_rmse_mean))[1],
rmse.train = min(train_rmse_mean),
ntrees.test = which(test_rmse_mean == min(test_rmse_mean))[1],
rmse.test = min(test_rmse_mean),))
# Plot error vs number trees
pr1 <-ggplot(ames_xgb1$evaluation_log) +
geom_line(aes(iter, train_rmse_mean), color = "red") +
geom_line(aes(iter, test_rmse_mean), color = "blue") +
ggtitle(label = "10-fold Cross-validation")
ggplotly(pr1)
We can see that the training error increased. However, the error on unseen data reaches a minimum RMSE of with iterations. With simple adaptation of our parameters we managed to decrease it to some extent. At this point it should be clear that it would take incredible effort to manually compute those errors for each possible combination of parameters that would potentially decrease the error further. Luckily, there are more elegant, automated solution for it. We can create a hyperparameter search grid along with columns to dump results in. Each row of the grid contains a combination of parameters we would like to model:
# Hyperparameter grid
hyper_grid <- expand.grid(
eta = c(.01,0.3),
max_depth = c(1,3,5,7),
min_child_weight = 3,
subsample = 0.5,
colsample_bytree = 0.5,
gamma = c(0, 1, 10, 100, 1000),
lambda = c(0, 1e-2, 0.1, 1, 100, 1000, 10000),
alpha = c(0, 1e-2, 0.1, 1, 100, 1000, 10000),
rmse = 0, # a place to dump RMSE results
trees = 0 # a place to dump required number of trees
)
# Head of the hyperparameter grid
kable(head(hyper_grid))
| eta | max_depth | min_child_weight | subsample | colsample_bytree | gamma | lambda | alpha | rmse | trees |
|---|---|---|---|---|---|---|---|---|---|
| 0.01 | 1 | 3 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 |
| 0.30 | 1 | 3 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 |
| 0.01 | 3 | 3 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 |
| 0.30 | 3 | 3 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 |
| 0.01 | 5 | 3 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 |
| 0.30 | 5 | 3 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 |
Besides those parameters we discussed, xgboost provides additional hyperparameters alpha, gamma and lambda that can help to constrain model complexity and reduce overfitting. We introduced them in the grid as well. With the code above we create a pretty large search grid consisting of 1960 different hyperparameter combinations to model. It is important to note that running such a grid in a loop procedure could take a couple of hours. We will create such a loop procedure to loop through and apply a xgboost model for each hyperparameter combination (1960 in our case) and finally provide us the results in the hyper_grid data frame.
# Grid search
for(i in seq_len(nrow(hyper_grid))) {
set.seed(1234)
m <- xgb.cv(
data = ames_x_train,
label = ames_y_train,
nrounds = 4000,
objective = "reg:linear",
early_stopping_rounds = 50,
nfold = 10,
verbose = 0,
params = list(
eta = hyper_grid$eta[i],
max_depth = hyper_grid$max_depth[i],
min_child_weight = hyper_grid$min_child_weight[i],
subsample = hyper_grid$subsample[i],
colsample_bytree = hyper_grid$colsample_bytree[i],
gamma = hyper_grid$gamma[i],
lambda = hyper_grid$lambda[i],
alpha = hyper_grid$alpha[i]
)
)
hyper_grid$rmse[i] <- min(m$evaluation_log$test_rmse_mean)
hyper_grid$trees[i] <- m$best_iteration
}
# Results
hyper_grid %>%
filter(rmse > 0) %>%
arrange(rmse) %>%
glimpse()
Here is a glimpse of results we obtained after several hours of processing:
## Observations: 98
## Variables: 10
## $ eta <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.0…
## $ max_depth <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ min_child_weight <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ subsample <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5…
## $ colsample_bytree <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5…
## $ gamma <dbl> 0, 1, 10, 100, 1000, 0, 1, 10, 10…
## $ lambda <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ alpha <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.1…
## $ rmse <dbl> 20488, 20488, 20488, 20488, 20488…
## $ trees <dbl> 3944, 3944, 3944, 3944, 3944, 381…
In the first “column” we see the combination of parameters given that results in the lowest estimated error (RMSE) possible for the combination given. Subsequently, we will use those parameters to enhance prediction performance of our model.
# The list of optimal hyperparameters
params_optimal <- list(
eta = 0.01,
max_depth = 3,
min_child_weight = 3,
subsample = 0.5,
colsample_bytree = 0.5,
lambda = 1
)
# Train final model with optimal combination of the given parameters
set.seed(1234)
xgb.fit.optimal <- xgboost(
params = params_optimal,
data = ames_x_train,
label = ames_y_train,
nrounds = 3944,
objective = "reg:linear",
verbose = 0
)
[14:58:20] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
After computing the final model, we can make inferences about how features (i.e. variables in our data set besides sale price) are influencing our model. Measurement of feature importance occurs based on the sum of the reduction in the loss function (e.g. SSE) attributed to each variable at each split in a respective tree. In other words, it is the relative contribution of the respective feature to the model computed by taking each feature’s contribution for each tree in the model. Therefore, those features with the highest average decrease in SSE (for regression) are identified as the one with the highest contribution. Thus, these features are among most important ones. To visualize feature importance plot we need to create importance matrix first then plot it with ggplot-based function “xgb.ggplot.importance”.
# Construct importance matrix
importance_matrix <- xgb.importance(model = xgb.fit.optimal)
# Variable importance plot with ggplot2
xgb.ggplot.importance(importance_matrix, top_n = 15)
As we identified the most relevant features, now we can try to understand how the response variable (i.e. predicted sale price) changes based on these variables. For this we can use partial dependence plots (PDPs). They show the marginal effect one or two features have on the predicted outcome. Let’s consider the gr_liv_area variable. The PDP plot below displays the average change in predicted sales price as we vary gr_liv_area while holding all other variables constant. More specifically, it shows the movement of the predicted sales price as the square footage of the ground floor in a house changes, while holding other variables constant. It is important to mention that PDPs are valid as long as the target variable (sale price) and the variable under observation (gr_liv_area) are not correlated. If this assumption is violated, the averages calculated for the partial dependence plot will include data points that are very unlikely or even impossible.
library(pdp)
pdp <- xgb.fit.optimal %>%
pdp::partial(pred.var = "gr_liv_area", n.trees = 3944, grid.resolution = 100, train = ames_x_train) %>%
autoplot(rug = TRUE, train = ames_x_train) +
scale_y_continuous(labels = scales::dollar) +
ggplot2::xlab(label= "Ground floor living area")+
ggplot2::ylab(label="Predicted sale price")+
ggtitle("PDP - Influence of the ground floor size on a house sale price")
ggplotly(pdp)
We use predict function to predict unseen observations from the test data set we created at the beginning of the chapter. Since we already know the real sale prices from the test data set, we will be able to calculate the error of our predictive model.
# Predict values for test data optimal
pred.optimal <- predict(xgb.fit.optimal, ames_x_test)
# Results with optimal parameters
RMSE(pred.optimal, ames_test[,308])
[1] 21988.76
Finally, we can nicely visualize predicted and actual sale price.
# Plot predictions vs actual sale price
pred.optimal <- as.data.frame(pred.optimal)
ames_test_pred<- data.frame(ames_test[,308],pred.optimal)
colnames(ames_test_pred) <- c("sale_price","pred.optimal")
p<-ggplot(ames_test_pred, aes(x = pred.optimal, y = sale_price, Predicted=pred.optimal, Tested=sale_price)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE)+
xlab(label = "Predicted sale price" )+
ylab(label = "Test sale price")+
ggtitle(label = "Predicted sale price vs sales price")
ggplotly(p,tooltip = c("Predicted","Tested"))